Strategies of Processing Japanese Names and Character Variants in Traditional Chinese Text

نویسندگان

Chuan-Jie Lin

Jia-Cheng Zhan

Yen-Heng Chen

Chien-Wei Pao

چکیده

This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji into the corresponding Traditional Chinese characters. The same method can also be used to detect words written in character variants. After integrating generation rules for various types of special words, as well as their probability models, the F-measure of our word segmentation system rises from 94.16% to 96.06%. Another experiment shows that 83.18% of the 862 Japanese names in a set of 109 human-annotated documents can be successfully detected.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extending Huffman Coding for Multilingual Text Compression

Traditional text compression algorithms such as Huffman and LZ variants are usually based on 8-bit characters sampling. However, under the unicode representation for multilingual information, the character set of each language such as Chinese and Japanese is consisted of a very number of distinct characters and thus 16-bit or 32-bit character sampling is needed. Consequently, when text compress...

متن کامل

Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese

Word Segmentation is usually considered an essential step for many Chinese and Japanese Natural Language Processing tasks, such as name tagging. This paper presents several new observations and analysis on the impact of word segmentation on name tagging; (1). Due to the limitation of current state-of-the-art Chinese word segmentation performance, a character-based name tagger can outperform its...

متن کامل

Spoken Correction for Chinese Text Entry

With an average of 17 Chinese characters per phonetic syllable, correcting conversion errors with current phonetic input method editors (IMEs) is often painstaking and time consuming. We explore the application of spoken character description as a correction interface for Chinese text entry, in part motivated by the common practice of describing Chinese characters in names for self-introduction...

متن کامل

FREQUENCY OF C3435 MDR1 AND A6896G CYP3A5 SINGLE NUCLEOTIDE POLYMORPHISM IN AN IRANIAN POPULATION AND COMPARISON WITH OTHER ETHNIC GROUPS

ABSTRACT Background: It is well recognized that different patients respond in different ways to medications. The inter-individual variations are greater than the intera- individual variations, a finding consistent with the notion that inheritance is a determinant of drug responses. The recent identification of genetic polymorphisms in drug-metabolizing enzymes and drug transporters led to the ...

متن کامل

Character-Level Dependencies in Chinese: Usefulness and Learning

We investigate the possibility of exploiting character-based dependency for Chinese information processing. As Chinese text is made up of character sequences rather than word sequences, word in Chinese is not so natural a concept as in English, nor is word easy to be defined without argument for such a language. Therefore we propose a character-level dependency scheme to represent primary lingu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IJCLCLP

دوره 17 شماره

صفحات -

تاریخ انتشار 2012

Strategies of Processing Japanese Names and Character Variants in Traditional Chinese Text

نویسندگان

چکیده

منابع مشابه

Extending Huffman Coding for Multilingual Text Compression

Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese

Spoken Correction for Chinese Text Entry

FREQUENCY OF C3435 MDR1 AND A6896G CYP3A5 SINGLE NUCLEOTIDE POLYMORPHISM IN AN IRANIAN POPULATION AND COMPARISON WITH OTHER ETHNIC GROUPS

Character-Level Dependencies in Chinese: Usefulness and Learning

عنوان ژورنال:

اشتراک گذاری